HaiBiostat: Exploratory / Confirmatory Factor Analysis?

Hai Nguyen

Why use factor analysis?

Factor analysis is a useful tool for investigating variable relationships for complex concepts such as socioeconomic status, dietary patterns, or psychological scales.

It allows researchers to investigate concepts that are not easily measured directly by collapsing a large number of variables into a few interpretable underlying factors.

What is a factor?

In factor analysis, a factor is an latent (unmeasured) variable that expresses itself through its relationship with other measured variables.

Take for example a variable like cognition. We may want to measure a person’s cognition, but this is the kind of construct that would be impossible to measure using a single variable. It’s just too abstract and multifaceted, although it does represent a single concept. So instead, you may have to develop the scale with many items, each of which measures some more measurable part of leadership. The idea would be that there is an underlying unmeasurable factor, cognition, that causes people to respond in certain patterns on the many items on the scale.

The purpose of factor analysis is to analyze these patterns of response as a way of getting at this underlying factor. Factor analysis also allows you to use the weighted item responses to create what are called factor scores. These represent a single score for each person on the factor. Factor scores are nice because they allow you to use a single variable as a measure of the factor in the other analyses, rather than a set of items.

Factor analysis often applied for continuous variables, but for nominal variables we would have no problem when using fa() in library psych

$\Rightarrow$ As demonstrated below, using ordinal or nominal/binary data for factor analysis in R is no more difficult than using continuous data for factor analysis in R.

If variables include mixed variables. If that is the case use cor="mixed" in fa() of psych library

A case study of Depression data of ordinal variables

Now, we jump on an example: Depression data (NHANES, 2014-15) includes 10 ordinal variables. This approach would be better helping us see through the meaningfulness of FA.

DPQ010 Have little interest in doing things
DPQ020* Feeling down, depressed, or hopeless
DPQ030 Trouble sleeping or sleeping too much
DPQ040 Feeling tired or having little energy
DPQ050 Poor appetite or overeating
DPQ060* Feeling bad about yourself
DPQ070 Trouble concentrating on things
DPQ080 Moving or speaking slowly or too fast
DPQ090* Thought you would be better off dead
DPQ100 Difficulty these problems have caused

Few more variables of different ordinal scales, which are not used in this example
* more critical/serious variables

Values of each variable has the same format of “ordinal” numerical values:

0 Not at all
1 Several days
2 More than half the days
3 Nearly every day

Higher is the value $\Rightarrow$ “more” likely to be depressed ???

Read raw data into R

options(width = 300)
dep <- read.csv("data/dep.csv",header = TRUE, 
                sep = ",", na.strings = "NA")
# na.strings : a character vector of strings which are to be 
#             interpreted as NA values.
head(dep)

   SEQN DPQ010 DPQ020 DPQ030 DPQ040 DPQ050 DPQ060 DPQ070 DPQ080 DPQ090 DPQ100 SLD010H SLQ050 SLQ060
1 73557      1      0      0      0      0      0      0      0      0      1       7      1      2
2 73558      2      0      0      0      0      0      0      0      0      0       9      2      2
3 73561      2      1      0      3      3      0      0      0      0      1       9      2      2
4 73562      3      3      3      3      3      1      2      1      0      3       5      2      1
5 73564      0      1      0      1      0      0      0      0      0      0       9      2      1
6 73566      0      0      1      0      0      0      0      0      0      0       6      1      2

nrow(dep)

[1] 3651

x <- dep[,2:11]

We first check our correlation matrix. Visual analysis of the correlation matrix is done to reveal some similarities or correlations that can be found in the data between groups of questions.

R <- cor(x)
library(corrplot)
corrplot(R,type="upper",order="hclust")

The blue structures are positively correlated groups and the red structure represents a negatively correlated group that emerges among the data.

FA using fa() form psych

library(psych)
library(GPArotation)

Note: psych package also has Tetrachoric, polychoric, biserial and polyserial correlations for various types of variables

head(x)

  DPQ010 DPQ020 DPQ030 DPQ040 DPQ050 DPQ060 DPQ070 DPQ080 DPQ090
1      1      0      0      0      0      0      0      0      0
2      2      0      0      0      0      0      0      0      0
3      2      1      0      3      3      0      0      0      0
4      3      3      3      3      3      1      2      1      0
5      0      1      0      1      0      0      0      0      0
6      0      0      1      0      0      0      0      0      0
  DPQ100
1      1
2      0
3      1
4      3
5      0
6      0

Important note for FA using psych lib of different types of variables

fa(data,n.factors) has an additional argument: cor with several options depending on input variable types

cor="cor" Pearson correlation, for numeric variables with different scales
cor="cov" Covariance, for numeric variables with similar scales
cor="tet" tetrachoric, for dichotomous variables
cor="poly" polychoric, for ordinal variables
cor="mixed" mixed cor for a mixture of tetrachorics, polychorics, Pearsons, biserials, and polyserials, Yuleb is Yulebonett, Yuleq and YuleY are the obvious Yule coefficients as appropriate

Remarks

tetrachoric “correlation” for 2 x 2 table or two dichotomous variables.
polychoric correlation for k1 x k2 ordinal variables.
biserial correlation for continuous variable and dichotmous x variable.
polyserial correlation for continuous variable and a ordered categorical Y.

May convert nominal variables into dummy variables but that increases dimensionality very quickly

Assumption for ordinal variable, it has latenent normal variable
Correlations between non-numeric variables is often related chi-square test stats

Since our variables are all ordinal lets use polychoric correlations

polychoric(x)

Call: polychoric(x = x)
Polychoric correlations 
       DPQ01 DPQ02 DPQ03 DPQ04 DPQ05 DPQ06 DPQ07 DPQ08 DPQ09 DPQ10
DPQ010 1.00                                                       
DPQ020 0.58  1.00                                                 
DPQ030 0.31  0.33  1.00                                           
DPQ040 0.38  0.38  0.37  1.00                                     
DPQ050 0.38  0.43  0.34  0.34  1.00                               
DPQ060 0.47  0.70  0.30  0.36  0.43  1.00                         
DPQ070 0.47  0.51  0.34  0.35  0.41  0.51  1.00                   
DPQ080 0.48  0.51  0.41  0.37  0.41  0.51  0.59  1.00             
DPQ090 0.40  0.65  0.34  0.37  0.37  0.70  0.49  0.49  1.00       
DPQ100 0.57  0.66  0.42  0.47  0.46  0.61  0.56  0.59  0.61  1.00 

 with tau of 
            1    2   3
DPQ010  0.296 1.06 1.5
DPQ020  0.368 1.22 1.6
DPQ030 -0.096 0.74 1.1
DPQ040 -0.665 0.68 1.1
DPQ050  0.333 1.07 1.5
DPQ060  0.660 1.36 1.7
DPQ070  0.635 1.26 1.6
DPQ080  0.980 1.52 1.9
DPQ090  1.646 2.12 2.4
DPQ100  0.629 1.60 2.1

Output has

rho=“correlations” and
tau=“the normal equivalent of the cutpoints”

fa for ordinal variables of more than two values.
In this case, assuming two factors.

fa.out <- fa(dep[,2:11], nfactors=2, cor='poly', rotate = "oblimin")

Note:
- oblimin is default
- If the variables are all 0/1 binary, use tetrachoric correlations i.e. cor=“tet”

fa.out

Factor Analysis using method =  minres
Call: fa(r = dep[, 2:11], nfactors = 2, rotate = "oblimin", cor = "poly")
Standardized loadings (pattern matrix) based upon correlation matrix
         MR2   MR1   h2   u2 com
DPQ010  0.54  0.15 0.44 0.56 1.1
DPQ020  0.19  0.67 0.69 0.31 1.2
DPQ030  0.69 -0.18 0.32 0.68 1.1
DPQ040  0.61 -0.06 0.32 0.68 1.0
DPQ050  0.54  0.05 0.34 0.66 1.0
DPQ060 -0.05  0.90 0.75 0.25 1.0
DPQ070  0.58  0.14 0.49 0.51 1.1
DPQ080  0.70  0.05 0.54 0.46 1.0
DPQ090  0.07  0.74 0.63 0.37 1.0
DPQ100  0.60  0.26 0.68 0.32 1.4

                       MR2  MR1
SS loadings           2.95 2.24
Proportion Var        0.29 0.22
Cumulative Var        0.29 0.52
Proportion Explained  0.57 0.43
Cumulative Proportion 0.57 1.00

 With factor correlations of 
     MR2  MR1
MR2 1.00 0.78
MR1 0.78 1.00

Mean item complexity =  1.1
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  45  and the objective function was  4.72 with Chi Square of  17223.44
The degrees of freedom for the model are 26  and the objective function was  0.14 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.04 

The harmonic number of observations is  3651 with the empirical chi square  246.89  with prob <  7.1e-38 
The total number of observations was  3651  with Likelihood Chi Square =  524.04  with prob <  3.7e-94 

Tucker Lewis Index of factoring reliability =  0.95
RMSEA index =  0.072  and the 90 % confidence intervals are  0.067 0.078
BIC =  310.76
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                   MR2  MR1
Correlation of (regression) scores with factors   0.93 0.94
Multiple R square of scores with factors          0.87 0.89
Minimum correlation of possible factor scores     0.74 0.78

names(fa.out)

 [1] "residual"      "dof"           "chi"           "nh"           
 [5] "rms"           "EPVAL"         "crms"          "EBIC"         
 [9] "ESABIC"        "fit"           "fit.off"       "sd"           
[13] "factors"       "complexity"    "n.obs"         "objective"    
[17] "criteria"      "STATISTIC"     "PVAL"          "Call"         
[21] "null.model"    "null.dof"      "null.chisq"    "TLI"          
[25] "RMSEA"         "BIC"           "SABIC"         "r.scores"     
[29] "R2"            "valid"         "score.cor"     "weights"      
[33] "rotation"      "communality"   "communalities" "uniquenesses" 
[37] "values"        "e.values"      "loadings"      "model"        
[41] "fm"            "rot.mat"       "Phi"           "Structure"    
[45] "method"        "scores"        "R2.scores"     "r"            
[49] "np.obs"        "fn"            "Vaccounted"

plot(fa.out$values,type='b')

biplot(fa.out)

head(fa.out$scores)

            MR2        MR1
[1,] -0.3143623 -0.3168916
[2,] -0.5140691 -0.4382790
[3,]  0.7452267  0.2120120
[4,]  3.2561279  2.2674301
[5,] -0.6919326 -0.3763604
[6,] -0.7741730 -0.6857362

Let’s check factor scores of subject 122, who is really depressed!

x[122,]

    DPQ010 DPQ020 DPQ030 DPQ040 DPQ050 DPQ060 DPQ070 DPQ080 DPQ090
122      3      3      3      2      2      3      2      0      3
    DPQ100
122      3

fa.out$scores[122,]

     MR2      MR1 
5.193272 6.472891

Really large numbers, look at the second score
Let’s get predicted values of x[122,] and they should be high

apply(x,2,mean) + fa.out$scores[122,]%*%t(as.matrix(fa.out$loadings[,1:2]))

       DPQ010   DPQ020   DPQ030   DPQ040   DPQ050   DPQ060   DPQ070
[1,] 4.350776 5.852346 3.323438 3.933419 3.722472 6.000532 4.341527
       DPQ080   DPQ090   DPQ100
[1,] 4.179323 5.194817 5.158928

Although 2nd factor is not significant according to Elbow rule, we still include it in analysis for its meaning!
Interpret the output as of numeric variables
But, care must be taken while generalizing FA of non-numeric variables

Interpretation:

F1 or MR1: weighted average non-critical/non-clinical depression variables
F2 or MR2: weighted average critical/clinical depression variables

Alternative way of doing same FA using correlation matrix but less output(no scores)

poly.R <- polychoric(x)$rho
fa.out <- fa(r=poly.R, nfactors=2, n.obs = nrow(dep[,2:11]))

What are factor loadings?

fa.diagram(fa.out, digits = 2)

The relationship of each variable to the underlying factor is expressed by the so-called factor loading. Let’s go through each part of the printed output.

round(cbind(fa.out$loadings,"h2"=fa.out$communality,"u2"=fa.out$uniquenesses,"com"=fa.out$complexity),2)

         MR2   MR1   h2   u2  com
DPQ010  0.54  0.15 0.44 0.56 1.15
DPQ020  0.19  0.67 0.69 0.31 1.17
DPQ030  0.69 -0.18 0.32 0.68 1.14
DPQ040  0.61 -0.06 0.32 0.68 1.02
DPQ050  0.54  0.05 0.34 0.66 1.02
DPQ060 -0.05  0.90 0.75 0.25 1.01
DPQ070  0.58  0.14 0.49 0.51 1.11
DPQ080  0.70  0.05 0.54 0.46 1.01
DPQ090  0.07  0.74 0.63 0.37 1.02
DPQ100  0.60  0.26 0.68 0.32 1.37

What’s MR, ML, PC etc.? These are factors, and the name merely reflects the fitting method, e.g. minimum residual, maximum likelihood, principal components. The default is minimum residual, so in this case MR.
Why are they ‘out of order’? the number assigned is arbitrary, but this has to do with a rotated solution. See the help file for more details, otherwise they are numbered in terms of variance accounted for.
h2: the amount of variance in the item/variable explained by the (retained) factors. It is the sum of the squared loadings, a.k.a. communality.
u2: 1 - h2. residual variance, a.k.a. uniqueness
com: Item complexity. Specifically it is “Hoffman’s index of complexity for each item. This is just $(\sum \lambda_i^2)^2/\sum \lambda_i^4$, where $\lambda_i$ is the factor loading on the $i^{th}$ factor. Basically it tells you how much an item reflects a single construct. It will be lower for relatively lower loadings.

fa.out$Vaccounted

                            MR2       MR1
SS loadings           2.9498769 2.2362545
Proportion Var        0.2949877 0.2236254
Cumulative Var        0.2949877 0.5186131
Proportion Explained  0.5688010 0.4311990
Cumulative Proportion 0.5688010 1.0000000

The variance accounted for portion of the output can be explained as follows:

SS loadings: These are the eigenvalues, the sum of the squared loadings. In this case where we are using a correlation matrix, summing across all factors would equal the number of variables used in the analysis.
Proportion Var: tells us how much of the overall variance the factor accounts for out of all the variables.
Cumulative Var: the cumulative sum of Proportion Var.
Proportion Explained: The relative amount of variance explained- Proportion Var/sum(Proportion Var).
Cumulative Proportion: the cumulative sum of Proportion Explained.
These are contained in model$Vaccounted.

The variable with the strongest association to the underlying latent variable. Factor 1, is DPQ060 (Feeling bad about yourself), with a factor loading of 0.90.

Since factor loadings can be interpreted like standardized regression coefficients, one could also say that the variable DPQ060 has a correlation of 0.90 with Factor 1. This would be considered a strong association for a factor analysis in most research fields.

Two other variables, DPQ090 (Thought you would be better off dead) and DPQ020 (Feeling down, depressed, or hopeless), are also associated with Factor 1. Based on the variables loading highly onto Factor 1, we could call it “Low self-esteem individual awareness.”

DPQ010, DPQ030, DPQ040, DPQ050, DPQ070, DPQ080 and DPQ100, however, have high factor loadings on the Factor 2. They seem to indicate the observed signs of depression, so we may want to call Factor 2 “Physical depression signs.”

fa.out$Phi

          MR2       MR1
MR2 1.0000000 0.7835837
MR1 0.7835837 1.0000000

Factor correlations
Whether you get this part of the analysis depends on whether or not these are estimated. You have to have multiple factors and a rotation that allows for the correlations.

factor correlations: the correlation matrix for the factors. $\phi (phi)$
Mean item complexity: the mean of com.
These are contained in model$Phi

Model test results and Fit Indices (already discuss in SEM)

Rotation in EFA

An feature of EFA is that the axes of the factors can be rotated within the multidimensional variable space.
What a factor analysis program does while determining the best fit between the variables and the latent factors. We have 10 variables that go into a factor analysis.
The program looks first for the strongest correlations between variables and the latent factor, and makes that Factor 1. Visually, one can think of it as an axis (Axis 1). The factor analysis program then looks for the second set of correlations and calls it Factor 2, and so on. Sometimes, the initial solution results in strong correlations of a variable with several factors or in a variable that has no strong correlations with any of the factors.

fa.dem <- factor.rotate(fa.out,0,plot=TRUE,xlim=c(-1,1),ylim=c(-1,1), title="oblimin rotation")

Rotations that allow for correlation are called oblique rotations; rotations that assume the factors are not correlated are called orthogonal rotations. Our graph shows an orthogonal rotation. We run the example above using the oblimin (oblique). Below, we test with varimax (orthogonal):

fa.out <- fa(dep[,2:11], nfactors=2, cor='poly', rotate = "varimax")
fa.out

Factor Analysis using method =  minres
Call: fa(r = dep[, 2:11], nfactors = 2, rotate = "varimax", cor = "poly")
Standardized loadings (pattern matrix) based upon correlation matrix
        MR2  MR1   h2   u2 com
DPQ010 0.40 0.53 0.44 0.56 1.9
DPQ020 0.71 0.43 0.69 0.31 1.6
DPQ030 0.17 0.54 0.32 0.68 1.2
DPQ040 0.24 0.52 0.32 0.68 1.4
DPQ050 0.31 0.49 0.34 0.66 1.7
DPQ060 0.81 0.31 0.75 0.25 1.3
DPQ070 0.41 0.56 0.49 0.51 1.8
DPQ080 0.38 0.63 0.54 0.46 1.6
DPQ090 0.71 0.34 0.63 0.37 1.4
DPQ100 0.53 0.63 0.68 0.32 1.9

                       MR2  MR1
SS loadings           2.60 2.58
Proportion Var        0.26 0.26
Cumulative Var        0.26 0.52
Proportion Explained  0.50 0.50
Cumulative Proportion 0.50 1.00

Mean item complexity =  1.6
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  45  and the objective function was  4.72 with Chi Square of  17223.44
The degrees of freedom for the model are 26  and the objective function was  0.14 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.04 

The harmonic number of observations is  3651 with the empirical chi square  246.89  with prob <  7.1e-38 
The total number of observations was  3651  with Likelihood Chi Square =  524.04  with prob <  3.7e-94 

Tucker Lewis Index of factoring reliability =  0.95
RMSEA index =  0.072  and the 90 % confidence intervals are  0.067 0.078
BIC =  310.76
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                   MR2  MR1
Correlation of (regression) scores with factors   0.87 0.82
Multiple R square of scores with factors          0.75 0.68
Minimum correlation of possible factor scores     0.51 0.35

Sometimes, the orthogonal rotation did not work out. i.e. no variable is loading highly onto another factor.

op <- par(mfrow=c(1,2))
cluster.plot(fa.out,xlim=c(-1,1),ylim=c(-1,1),title="Unrotated ")
f2r <- factor.rotate(fa.out,-65,plot=TRUE,xlim=c(-1,1),ylim=c(-1,1),title="rotated -65 degrees")

factor.rotate is useful for those cases that require specific rotations that are not available in more advanced packages such as GPArotation

Comparison of EFA and CFA

	EFA	CFA
What gets analyzed
	Correlation matrix (of items = indicators)	Covariance matrix (of items = indicators)
	➢ Only correlations among observed item responses are used ➢ Only a standardized solution is provided, so the original means and variances of the observed item responses are irrelevant	➢ Means, variances, and covariances of item responses are analyzed ➢ Item response means historically have been ignored ➢Output includes unstandardized AND standardized solutions
Interpretation
	Rotation	Defining interpretation
	➢ All items load on all factors (latent traits) ➢ Goal is to pick a rotation that gives closest approximation to “simple structure” ➢ No way of distinguishing latent variables due to traits being measured from correlation	➢ CFA is theory-driven: any structure becomes a testable hypothesis ➢ You specify number of latent factors and their inter-correlations ➢You specify which items load on which latent factors ➢ You specify any additional relationships for method/other covariance ➢ You just need a clue; you don’t have to be right (misfit is informative)
Model Fit
	Eye-balls and Opinion	Inferential tests via ML
	➢ #Factors? Scree plots, interpretability ➢ Which rotation? ➢ Which items load on each factor? Arbitrary cutoff of .3-.4ish ➢ Standard errors infrequently used)	➢ Global model fit test (and local) ➢Standard errors (and significance tests) of item loadings, error variances, and error covariances (all parameters) ➢ Ability to test appropriateness of model constraints or model additions via tests for change in model fit
Factor scores
	Don’t ever use factor scores from an EFA	Factor scores can be used, but only if necessary
	➢ Factor scores are indeterminate (especially due to rotation) ➢ Inconsistency in how factor models are applied to data	Best option: Test relations among latent factors directly through SEM ➢ Factor scores are less indeterminate in CFA, and could be used

Comment on this article Share:

Exploratory / Confirmatory Factor Analysis?